As a Data Analyst and a baseball fan, I wanted to explore the data surrounding the attendance for my local minor league baseball team, the Ogden Raptors. Specifically, I wanted to focus on two data analysis topics:
Web Scraping
Data Visualization
#import packages
import pandas as pd
from os import path
from bs4 import BeautifulSoup as Soup
import requests
import numpy as np
from pandas import DataFrame
import plotly.express as px
import plotly.graph_objects as go
The first step in this project was scraping that Raptors' schedule and results from their 2022 schedule site. The steps to do this are in the code below.
HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
att_response = requests.get('https://ogden-raptors.com/sports/bsb/2022/teams/ogdenraptors?view=attendance', headers=HEADERS)
att_soup = Soup(att_response.text)
tables = att_soup.find_all('table')
att_table = tables[15]
att_rows = att_table.find_all('tr')
def parse_att_row(row):
"""
Take in a tr tag and a tag and get the data out of it in the form of a list of
strings.
"""
return [str(x.string) for x in row.find_all(('td','a'))]
list_of_parsed_att_rows = [parse_att_row(row) for row in att_rows[1:]]
att_df = DataFrame(list_of_parsed_att_rows)
Now that we have the dataframe, the next step is cleaning the data and creating new columns to be used for this project and potential future projects.
# Relevant cols
att_df = att_df[[0,1,3,4]]
# Remove empty space befire att
att_df[4] = att_df[4].str.strip()
#remove empty space before result
att_df[3] = att_df[3].str.strip()
#remove unplayed games (either haven't occurred yet or were cancelled)
att_df = att_df[att_df[3] != "-"]
#Create location col
att_df['location'] = np.where(att_df[1].str.contains('at', na = False), 'Away', 'Home')
att_df['location'] = np.where(att_df[1].str.contains('vs.', na = False), 'Away', att_df['location'])
# Clean up opponent col
att_df[1] = att_df[1].str.replace('at\r\n', '')
att_df[1] = att_df[1].str.replace('vs.\r\n', '')
att_df[1] = att_df[1].str.strip()
# Name Cols
att_df.columns = ['Date', 'Opponent', 'Score', 'Attendance', 'Home_Away']
# Add Result
att_df['Result'] = att_df['Score'].str[0]
# Clean Score
att_df['Score'] = att_df['Score'].str.replace('W,', '')
att_df['Score'] = att_df['Score'].str.replace('L,', '')
att_df['Score'] = att_df['Score'].str.strip()
#Win Loss constants
att_df['win_constant'] = np.where(att_df['Result'] == "W",1,0)
att_df['loss_constant'] = np.where(att_df['Result'] == "L",1,0)
#win total
att_df['total_wins'] = att_df['win_constant'].cumsum()
#pregame win total
att_df['pregame_total_wins'] = np.where(att_df['Result'] == "W", att_df['total_wins'] - att_df['win_constant'], att_df['total_wins'])
#loss total
att_df['total_losses'] = att_df['loss_constant'].cumsum()
#pregame loss total
att_df['pregame_total_losses'] = np.where(att_df['Result'] == "L", att_df['total_losses'] - att_df['loss_constant'], att_df['total_losses'])
#game number
att_df['game'] = np.arange(len(att_df))+1
#pregame win %
att_df['pregame_win_%'] = att_df['pregame_total_wins'] / (att_df['game'] - 1)
#Score modification
att_df['Score'] = att_df['Score'].str.replace('-', '|')
#Date Type
att_df['Date'] = pd.to_datetime(att_df['Date'] + ' 2022', infer_datetime_format=True)
#Day of week col
att_df['Day_of_Week'] = att_df['Date'].dt.day_name()
#Attendance Clean
att_df['Attendance'] = att_df['Attendance'].str.replace(',', '')
att_df['Attendance'] = att_df['Attendance'].astype(str).astype(int)
#day_classifier
weekend_list = ['Friday', 'Saturday']
pattern = '|'.join(weekend_list)
att_df['day_classifer'] = np.where(att_df['Day_of_Week'].str.contains(pattern),'weekend','weekday')
#add datestring
att_df['Date_string'] = att_df['Date'].dt.strftime('%Y-%m-%d')
att_df['Date_string'][68] = att_df['Date_string'][68] + ' Game1'
att_df['Date_string'][69] = att_df['Date_string'][69] + ' Game2'
C:\Users\Mike Gallacher\AppData\Local\Temp\ipykernel_4688\2924845361.py:14: FutureWarning: The default value of regex will change from True to False in a future version.
att_df[1] = att_df[1].str.replace('vs.\r\n', '')
C:\Users\Mike Gallacher\AppData\Local\Temp\ipykernel_4688\2924845361.py:54: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
att_df['Date_string'][68] = att_df['Date_string'][68] + ' Game1'
C:\Users\Mike Gallacher\AppData\Local\Temp\ipykernel_4688\2924845361.py:55: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
att_df['Date_string'][69] = att_df['Date_string'][69] + ' Game2'
att_df[:5]
| Date | Opponent | Score | Attendance | Home_Away | Result | win_constant | loss_constant | total_wins | pregame_total_wins | total_losses | pregame_total_losses | game | pregame_win_% | Day_of_Week | day_classifer | Date_string | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2022-05-25 | Grand Junction Jackalopes | 12|7 | 3127 | Home | W | 1 | 0 | 1 | 0 | 0 | 0 | 1 | NaN | Wednesday | weekday | 2022-05-25 |
| 1 | 2022-05-26 | Grand Junction Jackalopes | 12|7 | 3940 | Home | L | 0 | 1 | 1 | 1 | 1 | 0 | 2 | 1.000000 | Thursday | weekday | 2022-05-26 |
| 2 | 2022-05-27 | Grand Junction Jackalopes | 10|7 | 3687 | Home | L | 0 | 1 | 1 | 1 | 2 | 1 | 3 | 0.500000 | Friday | weekend | 2022-05-27 |
| 3 | 2022-05-28 | Grand Junction Jackalopes | 10|4 | 2575 | Home | W | 1 | 0 | 2 | 1 | 2 | 2 | 4 | 0.333333 | Saturday | weekend | 2022-05-28 |
| 4 | 2022-05-31 | Boise Hawks | 9|6 | 2804 | Away | W | 1 | 0 | 3 | 2 | 2 | 2 | 5 | 0.500000 | Tuesday | weekday | 2022-05-31 |
For the scope of this project, I'm only choosing to look at data from the Raptors' home games.
home_df = att_df[(att_df['Home_Away'] == 'Home') &( att_df['Attendance'] >0)].reset_index()
home_df = home_df.drop(['index'], axis=1)
home_df['game'] = np.arange(len(home_df))+1
First let's simply look at the attendance from each of the home Raptors games.
#Explore
fig = px.bar(home_df, x='game', y='Attendance', hover_name = 'Date_string')
fig.update_xaxes(
tickvals = [*range(1,len(home_df),3)],
ticktext = home_df['Date_string'][[*range(0,len(home_df),3)]],
tickangle = 45
)
fig.show()
day_of_week_df = home_df.groupby('Day_of_Week')\
.agg({'Date' : 'count', 'Attendance' : 'mean'})\
.rename(columns = {'Date' : 'Count of Games', 'Attendance' : 'Average Attendance'})\
.reset_index()
day_of_week_df['day_of_week_num'] = np.where(day_of_week_df['Day_of_Week'] == 'Sunday', 1,
np.where(day_of_week_df['Day_of_Week'] == 'Monday', 2,
np.where(day_of_week_df['Day_of_Week'] == 'Tuesday', 3,
np.where(day_of_week_df['Day_of_Week'] == 'Wednesday', 4,
np.where(day_of_week_df['Day_of_Week'] == 'Thursday', 5,
np.where(day_of_week_df['Day_of_Week'] == 'Friday', 6,7))))))
day_of_week_df.sort_values(by=['day_of_week_num'], ignore_index=True)
| Day_of_Week | Count of Games | Average Attendance | day_of_week_num | |
|---|---|---|---|---|
| 0 | Sunday | 5 | 2048.400000 | 1 |
| 1 | Monday | 1 | 1873.000000 | 2 |
| 2 | Tuesday | 6 | 3587.166667 | 3 |
| 3 | Wednesday | 7 | 3303.142857 | 4 |
| 4 | Thursday | 9 | 4328.222222 | 5 |
| 5 | Friday | 9 | 4334.444444 | 6 |
| 6 | Saturday | 7 | 4010.285714 | 7 |
fig = px.bar(day_of_week_df, x='Day_of_Week', y='Average Attendance', hover_data= ['Count of Games'], category_orders = {'Day_of_Week' : ['Sunday', 'Monday', 'Tuesday', 'Wednesday','Thursday','Friday','Saturday']})
fig
I'd like to see if the local Ogden weather on the day of a Raptors game has any correlation with attendance. To do this, I need to scrape weather data from almanac.com. After creating the function to scrape the Almanac weather website, I need to run it just for the dates of the Raptors home games. The steps to do this are in the code below.
def parse_row2(row):
"""
Take in tr tag and get data in list of strings.
"""
return [str(x.string) for x in row]
def scrape_weather_day(day):
weather_response = requests.get(f'https://www.almanac.com/weather/history/zipcode/84404/{day}', headers = HEADERS)
weather_soup = Soup(weather_response.text)
weather_tables = weather_soup.find_all('table')
weather_table = weather_tables[0]
weather_ps = weather_table.find_all('p')
weather_parsed_rows = [parse_row2(row) for row in weather_ps]
weather_df = DataFrame(weather_parsed_rows)
weather_df = weather_df[0].to_frame().T
weather_df = weather_df[[0,1,2,6,10,11]]
weather_df.columns = ['min_temp','mean_temp','max_temp','total_precipitation','mean_wind_speed','max_sustained_wind_speed']
weather_df['day'] = day
return weather_df
# run the weather function for the Raptors game dates
rap_game_list = home_df['Date'].dt.strftime('%Y-%m-%d').unique().tolist()
raptor_weather = pd.concat([scrape_weather_day(x) for x in rap_game_list]).reset_index(drop = True)
raptor_weather['day'] = pd.to_datetime(raptor_weather["day"])
raptor_weather.head()
| min_temp | mean_temp | max_temp | total_precipitation | mean_wind_speed | max_sustained_wind_speed | day | |
|---|---|---|---|---|---|---|---|
| 0 | 45.0 | 62.5 | 79.0 | 0.00 | 7.02 | 13.81 | 2022-05-25 |
| 1 | 46.9 | 72.2 | 91.9 | 0.00 | 6.10 | 11.39 | 2022-05-26 |
| 2 | 55.0 | 77.9 | 91.9 | 0.00 | 7.83 | 14.96 | 2022-05-27 |
| 3 | 53.1 | 65.3 | 86.0 | 0.00 | 9.67 | 20.83 | 2022-05-28 |
| 4 | 51.1 | 67.0 | 80.1 | 0.00 | 4.49 | 10.24 | 2022-06-07 |
As you can see in the data below (far right columns) we can now see weather data (temp, precip, wind) for each game day in Ogden, UT.
home_df = pd.merge(home_df, raptor_weather, left_on = 'Date', right_on = 'day', how = 'left').drop('day', 1)
home_df['max_temp'] = pd.to_numeric(home_df['max_temp'])
home_df['mean_temp'] = pd.to_numeric(home_df['mean_temp'])
home_df['min_temp'] = pd.to_numeric(home_df['min_temp'])
home_df.head()
C:\Users\Mike Gallacher\AppData\Local\Temp\ipykernel_4688\39735578.py:1: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
| Date | Opponent | Score | Attendance | Home_Away | Result | win_constant | loss_constant | total_wins | pregame_total_wins | ... | pregame_win_% | Day_of_Week | day_classifer | Date_string | min_temp | mean_temp | max_temp | total_precipitation | mean_wind_speed | max_sustained_wind_speed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2022-05-25 | Grand Junction Jackalopes | 12|7 | 3127 | Home | W | 1 | 0 | 1 | 0 | ... | NaN | Wednesday | weekday | 2022-05-25 | 45.0 | 62.5 | 79.0 | 0.00 | 7.02 | 13.81 |
| 1 | 2022-05-26 | Grand Junction Jackalopes | 12|7 | 3940 | Home | L | 0 | 1 | 1 | 1 | ... | 1.000000 | Thursday | weekday | 2022-05-26 | 46.9 | 72.2 | 91.9 | 0.00 | 6.10 | 11.39 |
| 2 | 2022-05-27 | Grand Junction Jackalopes | 10|7 | 3687 | Home | L | 0 | 1 | 1 | 1 | ... | 0.500000 | Friday | weekend | 2022-05-27 | 55.0 | 77.9 | 91.9 | 0.00 | 7.83 | 14.96 |
| 3 | 2022-05-28 | Grand Junction Jackalopes | 10|4 | 2575 | Home | W | 1 | 0 | 2 | 1 | ... | 0.333333 | Saturday | weekend | 2022-05-28 | 53.1 | 65.3 | 86.0 | 0.00 | 9.67 | 20.83 |
| 4 | 2022-06-07 | Northern Colorado Owlz | 5|2 | 3970 | Home | W | 1 | 0 | 8 | 7 | ... | 0.700000 | Tuesday | weekday | 2022-06-07 | 51.1 | 67.0 | 80.1 | 0.00 | 4.49 | 10.24 |
5 rows × 23 columns
fig = px.scatter(home_df, x="max_temp", y="Attendance"\
#, color = "day_classifer",
,trendline="ols"\
)
fig.show()
Because the Raptors only played 44 home games in 2022, it is difficult to see meaningful attendance trends as it relates to 'day of week' and 'temperature'. In addition, the above exploration only looked at those two variables. Other variables that could be interesting to review in connection with attendance in future iterations include: